Comprehensive Weather Forecast Model Comparison: Chattanooga T2M

An XGBoost Approach to Multi-Day Temperature Forecasting

1. Introduction and Model Objectives

This document presents a comparative analysis of different XGBoost modeling strategies for forecasting daily mean temperature (T2M) in Chattanooga, TN, for 1 to 5 days ahead. Our primary objective is to evaluate the impact of various feature engineering techniques and model architectures on forecast accuracy.

We will compare four distinct models:

  • Bare Bones Baseline: A minimalist model using only T2M lags and basic time features.
  • Full Features (Non-Cascading): Utilizes all engineered features (derived, intra-city lags, cross-city lags, rolling windows) but trains independent models for each forecast horizon.
  • Cascading Forecasts: Employs the full feature set and a sequential training strategy, where predictions from earlier horizons are fed as features to later horizons.
  • Native Multi-Output: Uses the full feature set with XGBoost’s native multi-output tree strategy, aiming to learn interdependencies between all horizons in a single, unified model.

2. Common Data Loading and Feature Engineering Pipeline

This section defines the shared data acquisition and feature engineering steps that are applied consistently to all models for a fair, “apples-to-apples” comparison. This includes loading data for Chattanooga and all surrounding cities, calculating derived features, intra-city lags, cross-city lags, and rolling window features.

Data

This experiment uses data from the NASA Power data base accessed through the API. Surrounding cities data was also used: Scottsboro, Jasper, Winchester, Cleveland, Dalton, Athens, Fort Payne, Dayton, Ringgold

Train window: 2000-01-01 → 2022-12-31
Test window: 2023-01-01 → 2024-12-31

Dropped 45 rows due to NaNs after all feature engineering.
Final data shape after engineering & dropna: 9087 rows, 273 columns

Train rows: 8361
Test rows: 726
Show code
import plotly.express as px 
import pandas as pd 

# Combine train and test data for the context plot
context_plot_df = pd.concat([
    train_df[['date', 't2m']].assign(dataset='Train Data'),
    test_df[['date', 't2m']].assign(dataset='Test Data')
], ignore_index=True)

train_context_fig = px.line(
    context_plot_df, x="date", y="t2m", color="dataset", # <-- This line was cut off before
    title=f"Actual Mean Temperature for {target_city_name} (Train vs. Test Data)", # <-- Full title restored
    template="plotly_white",
    color_discrete_map={'Train Data': 'blue', 'Test Data': 'red'} # <-- This was missing
)
train_context_fig.update_layout(height=400, xaxis_title="Date", yaxis_title="T2M Mean Temp (°C)") # <-- This was missing
train_context_fig.show() # <-- This was missing

Model evaluation —————

3. Model 1: Bare Bones Baseline (04.000)

This model serves as a minimalist benchmark. It uses only t2m (current day) and its lags (1:7 days), along with basic time-based features (date_ordinal, doy_sin, doy_cos) from Chattanooga. All other complex features are excluded. It’s trained using the MultiOutputRegressor strategy.

Show code
from xgboost import XGBRegressor
from sklearn.multioutput import MultiOutputRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score, max_error
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


# --- Define Bare Bones Model's specific feature set ---
bare_bones_feature_cols = ["date_ordinal", "doy_sin", "doy_cos", "t2m"]
for lag in range(1, 8): # t2m_lag_1 to t2m_lag_7
    bare_bones_feature_cols.append(f"t2m_lag_{lag}")

# --- SECTION 4. Train XGBoost Model (Bare Bones) ---
# Prepare arrays for multi-output training
iX_train_bare = train_df[bare_bones_feature_cols]
iy_train_bare = train_df[target_cols_all_horizons] # All 5 targets

xgb_base_model_bare = XGBRegressor(n_estimators=100, learning_rate=0.1, random_state=0)
multi_output_model_bare = MultiOutputRegressor(estimator=xgb_base_model_bare, n_jobs=-1)
print(f"\n--- Training Bare Bones Model for targets: {target_cols_all_horizons} ---")
multi_output_model_bare.fit(iX_train_bare, iy_train_bare)
models_bare = {"multi_output_model": multi_output_model_bare} 

# --- SECTION 5. Forecast & Evaluate (Bare Bones) ---
df_eval_bare = pd.DataFrame({"date": test_df["date"]})
X_test_bare = test_df[bare_bones_feature_cols]
y_pred_all_targets_bare = models_bare["multi_output_model"].predict(X_test_bare)

current_model_results = [] 

for i, target_col in enumerate(target_cols_all_horizons):
    df_eval_bare[f"y_true_{target_col}"] = test_df[target_col]
    df_eval_bare[f"y_pred_{target_col}"] = y_pred_all_targets_bare[:, i]

    y_true = df_eval_bare[f"y_true_{target_col}"]
    y_pred = df_eval_bare[f"y_pred_{target_col}"]

    rmse = np.sqrt(mean_squared_error(y_true, y_pred))
    mae  = mean_absolute_error(y_true, y_pred)
    r2 = r2_score(y_true, y_pred)
    max_err = max_error(y_true, y_pred)
    
    current_model_results.append({
        'Model': 'Bare Bones',
        'Horizon': target_col.replace('t2m_', '').replace('_later', ' day(s)'),
        'RMSE': rmse,
        'MAE': mae,
        'R2': r2,
        'Max_Error': max_err
    })

all_model_comparison_results.extend(current_model_results)

print("\n--- Metrics for Bare Bones Model (Test Set) ---")
display(pd.DataFrame(current_model_results)) 

# --- SECTION 6.0 Plot Results (Bare Bones) ---
# MODIFIED: Remove train data
plot_df_list_bare = []
# MODIFIED: Source Actual (Today's Mean Temp) from df_eval_bare for test period only
plot_df_list_bare.append(df_eval_bare[["date", f"y_true_{target_cols_all_horizons[0]}"]].assign(series="Actual (Today's Mean Temp)").rename(columns={f"y_true_{target_cols_all_horizons[0]}": "temperature"}))
for target_col in target_cols_all_horizons:
    horizon_name = target_col.replace('t2m_', '').replace('_later', '')
    # Removed: plot_df_list_bare.append(train_df[["date", target_col]].assign(series=f"Train (t2m_{horizon_name})").rename(columns={target_col: "temperature"}))
    plot_df_list_bare.append(df_eval_bare[["date", f"y_true_{target_col}"]].assign(series=f"Test Actual ({horizon_name})").rename(columns={f"y_true_{target_col}": "temperature"}))
    plot_df_list_bare.append(df_eval_bare[["date", f"y_pred_{target_col}"]].assign(series=f"Test Pred ({horizon_name})").rename(columns={f"y_pred_{target_col}": "temperature"}))
plot_df_bare = pd.concat(plot_df_list_bare, ignore_index=True)

fig_bare = px.line(plot_df_bare, x="date", y="temperature", color="series", 
                   title=f"XGBoost Bare Bones Baseline for {target_city_name} - T2M Forecasts", 
                   template="plotly_white", color_discrete_map=color_map_plot, line_dash_map=line_dash_map_plot)
fig_bare.update_layout(
    height=600,
    xaxis_title="Date",
    yaxis_title="T2M Mean Temp (°C)",
    legend=dict(
        x=1,           # x=1 is the far right
        y=1,           # y=1 is the top
        xanchor="right",
        yanchor="top",
        bgcolor="rgba(255,255,255,0.7)",  # semi-transparent white background
        bordercolor="black",
        borderwidth=1
    )
)
fig_bare.show()

# --- SECTION 6.1 Plot differences (Bare Bones) ---
plot_error_df_list_bare = []
for target_col in target_cols_all_horizons:
    horizon_name = target_col.replace('t2m_', '').replace('_later', ' day(s)')
    error_col = f"error_{horizon_name}"
    df_eval_bare[error_col] = df_eval_bare[f"y_true_{target_col}"] - df_eval_bare[f"y_pred_{target_col}"]
    plot_error_df_list_bare.append(df_eval_bare[["date", error_col]].assign(series=f"Error ({horizon_name})").rename(columns={error_col: "error"}))
plot_error_df_bare = pd.concat(plot_error_df_list_bare, ignore_index=True)
fig_error_bare = px.line(plot_error_df_bare, x="date", y="error", color="series", 
                         title=f"Forecast Error for {target_city_name} (Test Set - Bare Bones)", 
                         labels={"error": "Prediction Error (°C)", "date": "Date"}, 
                         template="plotly_white", color_discrete_map=error_color_map_plot)
fig_error_bare.add_hline(y=0, line_dash="dash", line_color="grey", annotation_text="Zero Error")
fig_error_bare.update_layout(height=600)
fig_error_bare.show()

# --- SECTION 6.2 Feature Importance (Bare Bones) ---
for i, target_col in enumerate(target_cols_all_horizons): # MODIFIED: Loop through all target horizons
    # Get the specific XGBoost estimator for this target
    model_for_importance = models_bare["multi_output_model"].estimators_[i] 
    feature_importances = model_for_importance.feature_importances_

    importance_df_bare = pd.DataFrame({
        'Feature': bare_bones_feature_cols, # Use the bare_bones_feature_cols defined earlier
        'Importance': feature_importances
    })
    importance_df_bare = importance_df_bare.sort_values(by='Importance', ascending=False)

    # MODIFIED: Print top 20 features for the current target_col
    #print(f"\n--- Top 20 Feature Importances for Bare Bones Model (Target: {target_col}) ---") 
    #print(importance_df_bare.head(20)) 

    # MODIFIED: Plotting the top 20 feature importances for the current target_col
    plt.figure(figsize=(12, 8)) 
    sns.barplot(x='Importance', y='Feature', data=importance_df_bare.head(20), palette='viridis')
    plt.title(f'Top 20 Feature Importances for Bare Bones Model (Target: {target_col})') # Adjusted title
    plt.xlabel('Importance (F-score or Gain)')
    plt.ylabel('Feature')
    plt.grid(axis='x', linestyle='--', alpha=0.7)
    plt.tight_layout()
    plt.show()

--- Training Bare Bones Model for targets: ['t2m_1_day_later', 't2m_2_days_later', 't2m_3_days_later', 't2m_4_days_later', 't2m_5_days_later'] ---

--- Metrics for Bare Bones Model (Test Set) ---
Model Horizon RMSE MAE R2 Max_Error
0 Bare Bones 1_day day(s) 2.493334 1.860813 0.912683 9.279320
1 Bare Bones 2_days day(s) 3.437763 2.617557 0.834063 12.168999
2 Bare Bones 3_days day(s) 3.816496 2.905861 0.795535 13.472501
3 Bare Bones 4_days day(s) 3.854804 2.964682 0.791518 13.387121
4 Bare Bones 5_days day(s) 3.968461 3.026663 0.778850 14.513976

4. Model 2: Full Features (Non-Cascading - from 03.621)

This model utilizes all engineered features (derived, intra-city lags, cross-city from all 9 cities, and rolling windows) but trains independent models for each forecast horizon (via MultiOutputRegressor).

Show code
from xgboost import XGBRegressor
from sklearn.multioutput import MultiOutputRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score, max_error

# --- SECTION 4. Train XGBoost Model (Full Features, Non-Cascading) ---
iX_train_full_noncascading = train_df[full_feature_set_cols]
iy_train_full_noncascading = train_df[target_cols_all_horizons]

xgb_base_model_full_noncascading = XGBRegressor(n_estimators=100, learning_rate=0.1, random_state=0)
multi_output_model_full_noncascading = MultiOutputRegressor(estimator=xgb_base_model_full_noncascading, n_jobs=-1)
print(f"\n--- Training Full Features (Non-Cascading) Model for targets: {target_cols_all_horizons} ---")
multi_output_model_full_noncascading.fit(iX_train_full_noncascading, iy_train_full_noncascading)
models_full_noncascading = {"multi_output_model": multi_output_model_full_noncascading} 

# --- SECTION 5. Forecast & Evaluate (Full Features, Non-Cascading) ---
df_eval_full_noncascading = pd.DataFrame({"date": test_df["date"]})
X_test_full_noncascading = test_df[full_feature_set_cols]
y_pred_all_targets_full_noncascading = models_full_noncascading["multi_output_model"].predict(X_test_full_noncascading)

current_model_results = [] 

for i, target_col in enumerate(target_cols_all_horizons):
    df_eval_full_noncascading[f"y_true_{target_col}"] = test_df[target_col]
    df_eval_full_noncascading[f"y_pred_{target_col}"] = y_pred_all_targets_full_noncascading[:, i]

    y_true = df_eval_full_noncascading[f"y_true_{target_col}"]
    y_pred = df_eval_full_noncascading[f"y_pred_{target_col}"]

    rmse = np.sqrt(mean_squared_error(y_true, y_pred))
    mae  = mean_absolute_error(y_true, y_pred)
    r2 = r2_score(y_true, y_pred)
    max_err = max_error(y_true, y_pred)
    
    current_model_results.append({
        'Model': 'Full Features (Non-Cascading)',
        'Horizon': target_col.replace('t2m_', '').replace('_later', ' day(s)'),
        'RMSE': rmse,
        'MAE': mae,
        'R2': r2,
        'Max_Error': max_err
    })

all_model_comparison_results.extend(current_model_results)

print(f"\n--- Metrics for Full Features (Non-Cascading) Model (Test Set) ---")
display(pd.DataFrame(current_model_results))

# --- SECTION 6.0 Plot Results (Full Features, Non-Cascading) ---
# MODIFIED: Removed train data
plot_df_list_full_noncascading = []
# MODIFIED: Source Actual (Today's Mean Temp) from df_eval_full_noncascading for test period only
plot_df_list_full_noncascading.append(df_eval_full_noncascading[["date", f"y_true_{target_cols_all_horizons[0]}"]].assign(series="Actual (Today's Mean Temp)").rename(columns={f"y_true_{target_cols_all_horizons[0]}": "temperature"}))
for target_col in target_cols_all_horizons:
    horizon_name = target_col.replace('t2m_', '').replace('_later', ' day(s)')
    # Removed: plot_df_list_full_noncascading.append(train_df[["date", target_col]].assign(series=f"Train (t2m_{horizon_name})").rename(columns={target_col: "temperature"}))
    plot_df_list_full_noncascading.append(df_eval_full_noncascading[["date", f"y_true_{target_col}"]].assign(series=f"Test Actual ({horizon_name})").rename(columns={f"y_true_{target_col}": "temperature"}))
    plot_df_list_full_noncascading.append(df_eval_full_noncascading[["date", f"y_pred_{target_col}"]].assign(series=f"Test Pred ({horizon_name})").rename(columns={f"y_pred_{target_col}": "temperature"}))
plot_df_full_noncascading = pd.concat(plot_df_list_full_noncascading, ignore_index=True)

fig_full_noncascading = px.line(plot_df_full_noncascading, x="date", y="temperature", color="series", 
                                title=f"XGBoost Full Features (Non-Cascading) for {target_city_name} - T2M Forecasts", 
                                template="plotly_white", color_discrete_map=color_map_plot, line_dash_map=line_dash_map_plot)
fig_full_noncascading.update_layout(height=600, xaxis_title="Date", yaxis_title="T2M Mean Temp (°C)")
fig_full_noncascading.show()

# --- SECTION 6.1 Plot differences (Full Features, Non-Cascading) ---
plot_error_df_list_full_noncascading = []
for target_col in target_cols_all_horizons:
    horizon_name = target_col.replace('t2m_', '').replace('_later', ' day(s)')
    error_col = f"error_{horizon_name}"
    df_eval_full_noncascading[error_col] = df_eval_full_noncascading[f"y_true_{target_col}"] - df_eval_full_noncascading[f"y_pred_{target_col}"]
    plot_error_df_list_full_noncascading.append(df_eval_full_noncascading[["date", error_col]].assign(series=f"Error ({horizon_name})").rename(columns={error_col: "error"}))
plot_error_df_full_noncascading = pd.concat(plot_error_df_list_full_noncascading, ignore_index=True)
fig_error_full_noncascading = px.line(plot_error_df_full_noncascading, x="date", y="error", color="series", 
                                     title=f"Forecast Error for {target_city_name} (Test Set - Full Features, Non-Cascading)", 
                                     labels={"error": "Prediction Error (°C)", "date": "Date"}, 
                                     template="plotly_white", color_discrete_map=error_color_map_plot)
fig_error_full_noncascading.add_hline(y=0, line_dash="dash", line_color="grey", annotation_text="Zero Error")
fig_error_full_noncascading.update_layout(height=600)
fig_error_full_noncascading.show()

# --- SECTION 6.2 Feature Importance (Full Features, Non-Cascading) ---
print(f"\n--- Feature Importances for Full Features (Non-Cascading) Model ---")
for i, target_col in enumerate(target_cols_all_horizons):
    model_for_importance_full_noncascading = models_full_noncascading["multi_output_model"].estimators_[i]
    feature_importances_full_noncascading = model_for_importance_full_noncascading.feature_importances_
    importance_df_full_noncascading = pd.DataFrame({'Feature': full_feature_set_cols, 'Importance': feature_importances_full_noncascading})
    importance_df_full_noncascading = importance_df_full_noncascading.sort_values(by='Importance', ascending=False)
    print(f"\nFeature Importances for {target_col} (Full Features, Non-Cascading):")
    print(importance_df_full_noncascading.head(20))
    plt.figure(figsize=(12, 8))
    sns.barplot(x='Importance', y='Feature', data=importance_df_full_noncascading.head(20), palette='viridis')
    plt.title(f'Top 20 Feature Importances for {target_col} (Full Features, Non-Cascading)')
    plt.xlabel('Importance (F-score or Gain)')
    plt.ylabel('Feature')
    plt.grid(axis='x', linestyle='--', alpha=0.7)
    plt.tight_layout()
    plt.show()

--- Training Full Features (Non-Cascading) Model for targets: ['t2m_1_day_later', 't2m_2_days_later', 't2m_3_days_later', 't2m_4_days_later', 't2m_5_days_later'] ---

--- Metrics for Full Features (Non-Cascading) Model (Test Set) ---
Model Horizon RMSE MAE R2 Max_Error
0 Full Features (Non-Cascading) 1_day day(s) 2.011463 1.489929 0.943172 9.373296
1 Full Features (Non-Cascading) 2_days day(s) 3.084712 2.352502 0.866396 11.284673
2 Full Features (Non-Cascading) 3_days day(s) 3.617922 2.787716 0.816258 13.461348
3 Full Features (Non-Cascading) 4_days day(s) 3.716141 2.879343 0.806247 13.819242
4 Full Features (Non-Cascading) 5_days day(s) 3.929810 3.053671 0.783137 14.496209

--- Feature Importances for Full Features (Non-Cascading) Model ---

Feature Importances for t2m_1_day_later (Full Features, Non-Cascading):
                                   Feature  Importance
18                    t2m_avg_from_max_min    0.727284
6                                  t2m_max    0.076977
4                                      t2m    0.046108
3                                  doy_cos    0.004483
15                             prectotcorr    0.003251
11                                   wd10m    0.003026
20                                  wind_u    0.002754
184  winchester_t2m_avg_from_max_min_lag_1    0.002551
10                                   ws10m    0.002284
217                   fort payne_t2m_lag_2    0.002019
19                          ps_change_24hr    0.001876
174      jasper_t2m_avg_from_max_min_lag_1    0.001668
127             t2m_avg_from_max_min_lag_7    0.001659
12                                      ps    0.001614
22                    t2m_range_normalized    0.001604
5                                  t2m_min    0.001582
21                                  wind_v    0.001524
124             t2m_avg_from_max_min_lag_4    0.001448
13                           allsky_sw_dwn    0.001372
207                       athens_t2m_lag_2    0.001315


Feature Importances for t2m_2_days_later (Full Features, Non-Cascading):
                               Feature  Importance
6                              t2m_max    0.461344
18                t2m_avg_from_max_min    0.119644
4                                  t2m    0.054304
3                              doy_cos    0.028003
264  t2m_avg_from_max_min_roll_mean_7d    0.011241
248                   t2m_roll_mean_7d    0.008950
175  jasper_t2m_avg_from_max_min_lag_2    0.008186
127         t2m_avg_from_max_min_lag_7    0.007067
174  jasper_t2m_avg_from_max_min_lag_1    0.005733
12                                  ps    0.005435
123         t2m_avg_from_max_min_lag_3    0.004674
168                    jasper_ps_lag_1    0.004597
166                   jasper_t2m_lag_1    0.004152
246                   t2m_roll_mean_3d    0.003866
10                               ws10m    0.003383
158                scottsboro_ps_lag_1    0.003150
15                         prectotcorr    0.003064
2                              doy_sin    0.003025
227                   dayton_t2m_lag_2    0.002885
20                              wind_u    0.002863


Feature Importances for t2m_3_days_later (Full Features, Non-Cascading):
                               Feature  Importance
3                              doy_cos    0.289576
248                   t2m_roll_mean_7d    0.230206
4                                  t2m    0.032064
6                              t2m_max    0.027875
18                t2m_avg_from_max_min    0.024910
264  t2m_avg_from_max_min_roll_mean_7d    0.013331
246                   t2m_roll_mean_3d    0.012552
2                              doy_sin    0.007931
262  t2m_avg_from_max_min_roll_mean_3d    0.007714
177               winchester_t2m_lag_2    0.006007
12                                  ps    0.005631
29                           t2m_lag_7    0.005385
127         t2m_avg_from_max_min_lag_7    0.005090
1                          day_of_year    0.004884
226                   dayton_t2m_lag_1    0.004339
126         t2m_avg_from_max_min_lag_6    0.004036
252              dewpoint_roll_mean_7d    0.003414
227                   dayton_t2m_lag_2    0.003358
168                    jasper_ps_lag_1    0.003138
28                           t2m_lag_6    0.003057


Feature Importances for t2m_4_days_later (Full Features, Non-Cascading):
                                   Feature  Importance
3                                  doy_cos    0.308031
248                       t2m_roll_mean_7d    0.134334
264      t2m_avg_from_max_min_roll_mean_7d    0.034468
262      t2m_avg_from_max_min_roll_mean_3d    0.023982
18                    t2m_avg_from_max_min    0.017123
122             t2m_avg_from_max_min_lag_2    0.016080
6                                  t2m_max    0.014836
4                                      t2m    0.009422
2                                  doy_sin    0.007194
185  winchester_t2m_avg_from_max_min_lag_2    0.007005
1                              day_of_year    0.006551
224  fort payne_t2m_avg_from_max_min_lag_1    0.004471
119              t2m_dewpoint_spread_lag_6    0.004425
12                                      ps    0.004082
256                        ps_roll_mean_7d    0.004043
217                   fort payne_t2m_lag_2    0.003997
126             t2m_avg_from_max_min_lag_6    0.003858
218                    fort payne_ps_lag_1    0.003834
179                    winchester_ps_lag_2    0.003741
28                               t2m_lag_6    0.003275


Feature Importances for t2m_5_days_later (Full Features, Non-Cascading):
                                   Feature  Importance
3                                  doy_cos    0.304646
248                       t2m_roll_mean_7d    0.159524
262      t2m_avg_from_max_min_roll_mean_3d    0.032932
125             t2m_avg_from_max_min_lag_5    0.009109
6                                  t2m_max    0.008415
184  winchester_t2m_avg_from_max_min_lag_1    0.007402
215      athens_t2m_avg_from_max_min_lag_2    0.006810
2                                  doy_sin    0.006574
1                              day_of_year    0.006352
246                       t2m_roll_mean_3d    0.005623
121             t2m_avg_from_max_min_lag_1    0.005614
124             t2m_avg_from_max_min_lag_4    0.005467
218                    fort payne_ps_lag_1    0.004774
264      t2m_avg_from_max_min_roll_mean_7d    0.004419
252                  dewpoint_roll_mean_7d    0.004118
26                               t2m_lag_4    0.004058
256                        ps_roll_mean_7d    0.004041
18                    t2m_avg_from_max_min    0.004028
214      athens_t2m_avg_from_max_min_lag_1    0.003999
235      dayton_t2m_avg_from_max_min_lag_2    0.003770

5. Model 3: Cascading Forecasts (03.650)

This model builds on the full feature set (derived, intra-city lags, cross-city from all 9 cities, and rolling windows) but uses a sequential training strategy, where predictions from earlier horizons are fed as features to later horizons.

Show code
from xgboost import XGBRegressor
from sklearn.base import clone 

# --- SECTION 4. Train Cascading XGBoost Models ---
models_cascading = {} 
train_pred_features_df_cascading = pd.DataFrame(index=train_df.index) 
test_pred_features_df_cascading = pd.DataFrame(index=test_df.index)

xgb_base_model_cascading = XGBRegressor(n_estimators=100, learning_rate=0.1, random_state=0)

current_feature_cols_cascading_training = list(full_feature_set_cols) 

for i, target_col in enumerate(target_cols_all_horizons):
    print(f"\n--- Training Cascading Model for {target_col} ---")
    iX_train_cascading = train_df[current_feature_cols_cascading_training]
    iy_train_cascading = train_df[target_col]
    
    model_cascading = clone(xgb_base_model_cascading) 
    
    print(f"Features used for {target_col}: {len(current_feature_cols_cascading_training)} features")
    model_cascading.fit(iX_train_cascading, iy_train_cascading)
    models_cascading[target_col] = model_cascading 

    train_preds_cascading = model_cascading.predict(train_df[current_feature_cols_cascading_training])
    new_pred_feature_name = f"predicted_{target_col}"
    train_pred_features_df_cascading[new_pred_feature_name] = train_preds_cascading
    
    test_preds_cascading = model_cascading.predict(test_df[current_feature_cols_cascading_training])
    test_pred_features_df_cascading[new_pred_feature_name] = test_preds_cascading

    if i < len(target_cols_all_horizons) - 1:
        current_feature_cols_cascading_training.append(new_pred_feature_name)
        train_df[new_pred_feature_name] = train_pred_features_df_cascading[new_pred_feature_name]
        test_df[new_pred_feature_name] = test_pred_features_df_cascading[new_pred_feature_name]

print("\nAll cascading models trained.")

# --- SECTION 5. Forecast & Evaluate (Cascading) ---
df_eval_cascading = pd.DataFrame({"date": test_df["date"]})

current_model_results = [] 

for target_col in target_cols_all_horizons:
    df_eval_cascading[f"y_true_{target_col}"] = test_df[target_col]
    df_eval_cascading[f"y_pred_{target_col}"] = test_pred_features_df_cascading[f"predicted_{target_col}"]

    y_true = df_eval_cascading[f"y_true_{target_col}"]
    y_pred = df_eval_cascading[f"y_pred_{target_col}"]

    rmse = np.sqrt(mean_squared_error(y_true, y_pred))
    mae  = mean_absolute_error(y_true, y_pred)
    r2 = r2_score(y_true, y_pred)
    max_err = max_error(y_true, y_pred)
    
    current_model_results.append({
        'Model': 'Cascading',
        'Horizon': target_col.replace('t2m_', '').replace('_later', ' day(s)'),
        'RMSE': rmse,
        'MAE': mae,
        'R2': r2,
        'Max_Error': max_err
    })

all_model_comparison_results.extend(current_model_results)

print(f"\n--- Metrics for Cascading Model (Test Set) ---")
display(pd.DataFrame(current_model_results))

# --- SECTION 6.0 Plot Results (Cascading) ---
# MODIFIED: Removed train data
plot_df_list_cascading = []
# MODIFIED: Source Actual (Today's Mean Temp) from df_eval_cascading for test period only
plot_df_list_cascading.append(df_eval_cascading[["date", f"y_true_{target_cols_all_horizons[0]}"]].assign(series="Actual (Today's Mean Temp)").rename(columns={f"y_true_{target_cols_all_horizons[0]}": "temperature"}))
for target_col in target_cols_all_horizons:
    horizon_name = target_col.replace('t2m_', '').replace('_later', ' day(s)')
    # Removed: plot_df_list_cascading.append(train_df[["date", target_col]].assign(series=f"Train (t2m_{horizon_name})").rename(columns={target_col: "temperature"}))
    plot_df_list_cascading.append(df_eval_cascading[["date", f"y_true_{target_col}"]].assign(series=f"Test Actual ({horizon_name})").rename(columns={f"y_true_{target_col}": "temperature"}))
    plot_df_list_cascading.append(df_eval_cascading[["date", f"y_pred_{target_col}"]].assign(series=f"Test Pred ({horizon_name})").rename(columns={f"y_pred_{target_col}": "temperature"}))
plot_df_cascading = pd.concat(plot_df_list_cascading, ignore_index=True)

fig_cascading = px.line(plot_df_cascading, x="date", y="temperature", color="series", 
                        title=f"XGBoost Cascading Forecasts for {target_city_name} - T2M Forecasts", 
                        template="plotly_white", color_discrete_map=color_map_plot, line_dash_map=line_dash_map_plot)
fig_cascading.update_layout(height=600, xaxis_title="Date", yaxis_title="T2M Mean Temp (°C)")
fig_cascading.show()

# --- SECTION 6.1 Plot differences (Cascading) ---
plot_error_df_list_cascading = []
for target_col in target_cols_all_horizons:
    horizon_name = target_col.replace('t2m_', '').replace('_later', ' day(s)')
    error_col = f"error_{horizon_name}"
    df_eval_cascading[error_col] = df_eval_cascading[f"y_true_{target_col}"] - df_eval_cascading[f"y_pred_{target_col}"]
    plot_error_df_list_cascading.append(df_eval_cascading[["date", error_col]].assign(series=f"Error ({horizon_name})").rename(columns={error_col: "error"}))
plot_error_df_cascading = pd.concat(plot_error_df_list_cascading, ignore_index=True)
fig_error_cascading = px.line(plot_error_df_cascading, x="date", y="error", color="series", 
                              title=f"Forecast Error for {target_city_name} (Test Set - Cascading)", 
                              labels={"error": "Prediction Error (°C)", "date": "Date"}, 
                              template="plotly_white", color_discrete_map=error_color_map_plot)
fig_error_cascading.add_hline(y=0, line_dash="dash", line_color="grey", annotation_text="Zero Error")
fig_error_cascading.update_layout(height=600)
fig_error_cascading.show()

# --- SECTION 6.2 Feature Importance (Cascading) ---
print(f"\n--- Feature Importances for Cascading Models ---")
for i, target_col in enumerate(target_cols_all_horizons):
    model_for_importance_cascading = models_cascading[target_col]
    feature_importances_cascading = model_for_importance_cascading.feature_importances_
    
    model_specific_feature_cols_cascading = list(full_feature_set_cols) 
    for prev_i in range(i):
        model_specific_feature_cols_cascading.append(f"predicted_{target_cols_all_horizons[prev_i]}")

    importance_df_cascading = pd.DataFrame({'Feature': model_specific_feature_cols_cascading, 'Importance': feature_importances_cascading})
    importance_df_cascading = importance_df_cascading.sort_values(by='Importance', ascending=False)
    print(f"\nFeature Importances for {target_col} (Cascading):")
    print(importance_df_cascading.head(20))
    plt.figure(figsize=(12, 8))
    sns.barplot(x='Importance', y='Feature', data=importance_df_cascading.head(20), palette='viridis')
    plt.title(f'Top 20 Feature Importances for {target_col} (Cascading)')
    plt.xlabel('Importance (F-score or Gain)')
    plt.ylabel('Feature')
    plt.grid(axis='x', linestyle='--', alpha=0.7)
    plt.tight_layout()
    plt.show()

--- Training Cascading Model for t2m_1_day_later ---
Features used for t2m_1_day_later: 266 features

--- Training Cascading Model for t2m_2_days_later ---
Features used for t2m_2_days_later: 267 features

--- Training Cascading Model for t2m_3_days_later ---
Features used for t2m_3_days_later: 268 features

--- Training Cascading Model for t2m_4_days_later ---
Features used for t2m_4_days_later: 269 features

--- Training Cascading Model for t2m_5_days_later ---
Features used for t2m_5_days_later: 270 features

All cascading models trained.

--- Metrics for Cascading Model (Test Set) ---
Model Horizon RMSE MAE R2 Max_Error
0 Cascading 1_day day(s) 2.011463 1.489929 0.943172 9.373296
1 Cascading 2_days day(s) 3.142463 2.382870 0.861346 12.095691
2 Cascading 3_days day(s) 3.671489 2.797543 0.810777 14.031865
3 Cascading 4_days day(s) 3.939334 3.025852 0.782275 13.395379
4 Cascading 5_days day(s) 4.013391 3.094258 0.773814 13.291235

--- Feature Importances for Cascading Models ---

Feature Importances for t2m_1_day_later (Cascading):
                                   Feature  Importance
18                    t2m_avg_from_max_min    0.727284
6                                  t2m_max    0.076977
4                                      t2m    0.046108
3                                  doy_cos    0.004483
15                             prectotcorr    0.003251
11                                   wd10m    0.003026
20                                  wind_u    0.002754
184  winchester_t2m_avg_from_max_min_lag_1    0.002551
10                                   ws10m    0.002284
217                   fort payne_t2m_lag_2    0.002019
19                          ps_change_24hr    0.001876
174      jasper_t2m_avg_from_max_min_lag_1    0.001668
127             t2m_avg_from_max_min_lag_7    0.001659
12                                      ps    0.001614
22                    t2m_range_normalized    0.001604
5                                  t2m_min    0.001582
21                                  wind_v    0.001524
124             t2m_avg_from_max_min_lag_4    0.001448
13                           allsky_sw_dwn    0.001372
207                       athens_t2m_lag_2    0.001315


Feature Importances for t2m_2_days_later (Cascading):
                               Feature  Importance
266          predicted_t2m_1_day_later    0.435157
3                              doy_cos    0.011623
21                              wind_v    0.010863
12                                  ps    0.008163
177               winchester_t2m_lag_2    0.007976
51                      dewpoint_lag_1    0.006614
8                             dewpoint    0.006485
5                              t2m_min    0.005716
18                t2m_avg_from_max_min    0.005442
207                   athens_t2m_lag_2    0.005159
17                 t2m_dewpoint_spread    0.004861
14                       allsky_lw_dwn    0.004619
122         t2m_avg_from_max_min_lag_2    0.004476
9                                 rh2m    0.004461
22                t2m_range_normalized    0.004362
214  athens_t2m_avg_from_max_min_lag_1    0.004241
19                      ps_change_24hr    0.004136
127         t2m_avg_from_max_min_lag_7    0.004087
119          t2m_dewpoint_spread_lag_6    0.004034
222            fort payne_wind_v_lag_1    0.004012


Feature Importances for t2m_3_days_later (Cascading):
                                   Feature  Importance
267             predicted_t2m_2_days_later    0.411059
3                                  doy_cos    0.008865
266              predicted_t2m_1_day_later    0.008116
227                       dayton_t2m_lag_2    0.007661
166                       jasper_t2m_lag_1    0.006897
168                        jasper_ps_lag_1    0.005644
246                       t2m_roll_mean_3d    0.005621
167                       jasper_t2m_lag_2    0.005546
217                   fort payne_t2m_lag_2    0.005324
224  fort payne_t2m_avg_from_max_min_lag_1    0.005315
214      athens_t2m_avg_from_max_min_lag_1    0.005199
235      dayton_t2m_avg_from_max_min_lag_2    0.005177
177                   winchester_t2m_lag_2    0.005113
110             t2m_ratio_to_t2m_min_lag_4    0.004836
127             t2m_avg_from_max_min_lag_7    0.004367
38                           t2m_max_lag_2    0.004304
225  fort payne_t2m_avg_from_max_min_lag_2    0.004264
185  winchester_t2m_avg_from_max_min_lag_2    0.003921
176                   winchester_t2m_lag_1    0.003894
219                    fort payne_ps_lag_2    0.003882


Feature Importances for t2m_4_days_later (Cascading):
                                   Feature  Importance
268             predicted_t2m_3_days_later    0.399316
3                                  doy_cos    0.011178
234      dayton_t2m_avg_from_max_min_lag_1    0.008493
167                       jasper_t2m_lag_2    0.008037
262      t2m_avg_from_max_min_roll_mean_3d    0.007185
267             predicted_t2m_2_days_later    0.006732
18                    t2m_avg_from_max_min    0.006692
185  winchester_t2m_avg_from_max_min_lag_2    0.006123
246                       t2m_roll_mean_3d    0.005963
176                   winchester_t2m_lag_1    0.005246
215      athens_t2m_avg_from_max_min_lag_2    0.004772
122             t2m_avg_from_max_min_lag_2    0.004666
266              predicted_t2m_1_day_later    0.004643
126             t2m_avg_from_max_min_lag_6    0.004401
224  fort payne_t2m_avg_from_max_min_lag_1    0.004004
169                        jasper_ps_lag_2    0.003795
125             t2m_avg_from_max_min_lag_5    0.003767
264      t2m_avg_from_max_min_roll_mean_7d    0.003662
33                           t2m_min_lag_4    0.003649
89                     allsky_sw_dwn_lag_4    0.003552


Feature Importances for t2m_5_days_later (Cascading):
                                   Feature  Importance
269             predicted_t2m_4_days_later    0.405377
3                                  doy_cos    0.015155
248                       t2m_roll_mean_7d    0.010913
268             predicted_t2m_3_days_later    0.007352
207                       athens_t2m_lag_2    0.006097
267             predicted_t2m_2_days_later    0.006063
122             t2m_avg_from_max_min_lag_2    0.005713
126             t2m_avg_from_max_min_lag_6    0.005179
266              predicted_t2m_1_day_later    0.004981
184  winchester_t2m_avg_from_max_min_lag_1    0.004843
235      dayton_t2m_avg_from_max_min_lag_2    0.004669
27                               t2m_lag_5    0.004618
252                  dewpoint_roll_mean_7d    0.004489
18                    t2m_avg_from_max_min    0.004313
53                          dewpoint_lag_3    0.004147
42                           t2m_max_lag_6    0.004098
216                   fort payne_t2m_lag_1    0.003874
167                       jasper_t2m_lag_2    0.003843
262      t2m_avg_from_max_min_roll_mean_3d    0.003829
157                   scottsboro_t2m_lag_2    0.003787

6. Model 4: Native Multi-Output (03.730)

This model uses the full feature set with XGBoost’s native multi-output tree strategy, aiming to learn interdependencies between all 5 horizons in a single, unified model.

Show code
from xgboost import XGBRegressor

from xgboost import XGBRegressor

# --- SECTION 4. Train XGBoost Model (Native Multi-Output) ---
iX_train_native = train_df[full_feature_set_cols]
iy_train_native = train_df[target_cols_all_horizons]

xgb_native_multi_output_model = XGBRegressor(
    n_estimators=100,
    learning_rate=0.1,
    random_state=0,
    tree_method="hist",
    multi_strategy="multi_output_tree"
)

print(f"\n--- Training Native Multi-Output Model for targets: {target_cols_all_horizons} ---")
xgb_native_multi_output_model.fit(iX_train_native, iy_train_native)
models_native_multioutput = {"multi_output_model": xgb_native_multi_output_model} 

# --- SECTION 5. Forecast & Evaluate (Native Multi-Output) ---
df_eval_native_multioutput = pd.DataFrame({"date": test_df["date"]})
X_test_native = test_df[full_feature_set_cols]
y_pred_all_targets_native = models_native_multioutput["multi_output_model"].predict(X_test_native)

current_model_results = [] 

for i, target_col in enumerate(target_cols_all_horizons):
    df_eval_native_multioutput[f"y_true_{target_col}"] = test_df[target_col]
    df_eval_native_multioutput[f"y_pred_{target_col}"] = y_pred_all_targets_native[:, i]

    y_true = df_eval_native_multioutput[f"y_true_{target_col}"]
    y_pred = df_eval_native_multioutput[f"y_pred_{target_col}"]

    rmse = np.sqrt(mean_squared_error(y_true, y_pred))
    mae  = mean_absolute_error(y_true, y_pred)
    r2 = r2_score(y_true, y_pred)
    max_err = max_error(y_true, y_pred)
    
    print(f"\nMetrics for {target_col} (Native Multi-Output):")
    print(f"  RMSE: {rmse:.3f}, MAE: {mae:.3f}, R2: {r2:.3f}, Max Error: {max_err:.3f}")

    current_model_results.append({
        'Model': 'Native Multi-Output',
        'Horizon': target_col.replace('t2m_', '').replace('_later', ' day(s)'),
        'RMSE': rmse,
        'MAE': mae,
        'R2': r2,
        'Max_Error': max_err
    })

all_model_comparison_results.extend(current_model_results)

print(f"\n--- Metrics for Native Multi-Output Model (Test Set) ---")
display(pd.DataFrame(current_model_results))

# --- SECTION 6.0 Plot Results (Native Multi-Output) ---
plot_df_list_native = []
plot_df_list_native.append(df_final[["date", "t2m"]].assign(series="Actual (Today's Mean Temp)").rename(columns={"t2m": "temperature"}))
for target_col in target_cols_all_horizons:
    horizon_name = target_col.replace('t2m_', '').replace('_later', ' day(s)')
    plot_df_list_native.append(train_df[["date", target_col]].assign(series=f"Train (t2m_{horizon_name})").rename(columns={target_col: "temperature"}))
    plot_df_list_native.append(df_eval_native_multioutput[["date", f"y_true_{target_col}"]].assign(series=f"Test Actual ({horizon_name})").rename(columns={f"y_true_{target_col}": "temperature"}))
    plot_df_list_native.append(df_eval_native_multioutput[["date", f"y_pred_{target_col}"]].assign(series=f"Test Pred ({horizon_name})").rename(columns={f"y_pred_{target_col}": "temperature"}))
plot_df_native = pd.concat(plot_df_list_native, ignore_index=True)

fig_native = px.line(plot_df_native, x="date", y="temperature", color="series", 
                     title=f"XGBoost Native Multi-Output for {target_city_name} - T2M Forecasts", 
                     template="plotly_white", color_discrete_map=color_map_plot, line_dash_map=line_dash_map_plot)
fig_native.update_layout(height=600, xaxis_title="Date", yaxis_title="T2M Mean Temp (°C)")
fig_native.show()

# --- SECTION 6.1 Plot differences (Native Multi-Output) ---
plot_error_df_list_native = []
for target_col in target_cols_all_horizons:
    horizon_name = target_col.replace('t2m_', '').replace('_later', ' day(s)')
    error_col = f"error_{horizon_name}"
    df_eval_native_multioutput[error_col] = df_eval_native_multioutput[f"y_true_{target_col}"] - df_eval_native_multioutput[f"y_pred_{target_col}"]
    plot_error_df_list_native.append(df_eval_native_multioutput[["date", error_col]].assign(series=f"Error ({horizon_name})").rename(columns={error_col: "error"}))
plot_error_df_native = pd.concat(plot_error_df_list_native, ignore_index=True)
fig_error_native = px.line(plot_error_df_native, x="date", y="error", color="series", 
                           title=f"Forecast Error for {target_city_name} (Test Set - Native Multi-Output)", 
                           labels={"error": "Prediction Error (°C)", "date": "Date"}, 
                           template="plotly_white", color_discrete_map=error_color_map_plot)
fig_error_native.add_hline(y=0, line_dash="dash", line_color="grey", annotation_text="Zero Error")
fig_error_native.update_layout(height=600)
fig_error_native.show()

# --- SECTION 6.2 Feature Importance (Native Multi-Output) ---
print(f"\n--- Feature Importances for Native Multi-Output Model ---")
# This section remains commented out as native multi-output tree doesn't directly support feature_importances_
print("NOTE: Feature importances for native multi-output trees (multi_strategy='multi_output_tree')")
print("are not directly available via .feature_importances_ in current XGBoost versions (due to internal gain calculation limitations).")

--- Training Native Multi-Output Model for targets: ['t2m_1_day_later', 't2m_2_days_later', 't2m_3_days_later', 't2m_4_days_later', 't2m_5_days_later'] ---

Metrics for t2m_1_day_later (Native Multi-Output):
  RMSE: 2.092, MAE: 1.592, R2: 0.939, Max Error: 9.345

Metrics for t2m_2_days_later (Native Multi-Output):
  RMSE: 3.080, MAE: 2.364, R2: 0.867, Max Error: 11.512

Metrics for t2m_3_days_later (Native Multi-Output):
  RMSE: 3.527, MAE: 2.716, R2: 0.825, Max Error: 12.744

Metrics for t2m_4_days_later (Native Multi-Output):
  RMSE: 3.725, MAE: 2.874, R2: 0.805, Max Error: 13.645

Metrics for t2m_5_days_later (Native Multi-Output):
  RMSE: 3.814, MAE: 2.927, R2: 0.796, Max Error: 15.915

--- Metrics for Native Multi-Output Model (Test Set) ---
Model Horizon RMSE MAE R2 Max_Error
0 Native Multi-Output 1_day day(s) 2.092171 1.592071 0.938520 9.345047
1 Native Multi-Output 2_days day(s) 3.079716 2.364106 0.866828 11.512363
2 Native Multi-Output 3_days day(s) 3.526775 2.715950 0.825400 12.744113
3 Native Multi-Output 4_days day(s) 3.725496 2.874296 0.805271 13.645490
4 Native Multi-Output 5_days day(s) 3.814167 2.926581 0.795712 15.915308

--- Feature Importances for Native Multi-Output Model ---
NOTE: Feature importances for native multi-output trees (multi_strategy='multi_output_tree')
are not directly available via .feature_importances_ in current XGBoost versions (due to internal gain calculation limitations).

8. Conclusion and Future Work

Conclusion (Add your overall conclusions here based on the metrics and plots. Discuss which model performed best, why you think so, and the impact of different feature sets and strategies.)

Future Work (Suggest potential next steps, such as hyperparameter tuning for the best model, exploring other advanced features, trying different model types, or implementing true out-of-fold validation.)